loss landscape geometry
Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel
In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK. We do so through a large-scale phenomenological analysis of training, synthesizing diverse measures characterizing loss landscape geometry and NTK dynamics. In multiple neural architectures and datasets, we find these diverse measures evolve in a highly correlated manner, revealing a universal picture of the deep learning process. In this picture, deep network training exhibits a highly chaotic rapid initial transient that within 2 to 3 epochs determines the final linearly connected basin of low loss containing the end point of training. During this chaotic transient, the NTK changes rapidly, learning useful features from the training data that enables it to outperform the standard initial NTK by a factor of 3 in less than 3 to 4 epochs. After this rapid chaotic transient, the NTK changes at constant velocity, and its performance matches that of full network training in 15\% to 45\% of training time. Overall, our analysis reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
The effect of the number of parameters and the number of local feature patches on loss landscapes in distributed quantum neural networks
Quantum neural networks hold promise for tackling computationally challenging tasks that are intractable for classical computers. However, their practical application is hindered by significant optimization challenges, arising from complex loss landscapes characterized by barren plateaus and numerous local minima. These problems become more severe as the number of parameters or qubits increases, hampering effective training. To mitigate these optimization challenges, particularly for quantum machine learning applied to classical data, we employ an approach of distributing overlapping local patches across multiple quantum neural networks, processing each patch with an independent quantum neural network, and aggregating their outputs for prediction. In this study, we investigate how the number of parameters and patches affects the loss landscape geometry of this distributed quantum neural network architecture via Hessian analysis and loss landscape visualization. Our results confirm that increasing the number of parameters tends to lead to deeper and sharper loss landscapes. Crucially, we demonstrate that increasing the number of patches significantly reduces the largest Hessian eigenvalue at minima. This finding suggests that our distributed patch approach acts as a form of implicit regularization, promoting optimization stability and potentially enhancing generalization. Our study provides valuable insights into optimization challenges and highlights that the distributed patch approach is a promising strategy for developing more trainable and practical quantum machine learning models for classical data tasks.
Review for NeurIPS paper: Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel
Additional Feedback: Minor issues *Visualization method of Figure 1: I am not sure how the authors depict this paper. Is it based on PCA of trajectories? It is also unclear why linear lines give these trajectories. It is just a linear regression with the Taylorized model (2). More technically speaking, when we use data-dependent NTK in a linearized model, the positive definiteness of this NTK is non-trivial and the equivalence to the kernel regression becomes unclear.
Review for NeurIPS paper: Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel
The reviews for this paper were overall positive. The paper presents a empirical inquiry of the landscape of the loss of the data-dependent neural tangent kernel. The authors examine dynamics of the kernel, the loss landscape, comparing the learned kernel to corresponding neural networks. The reviewers appreciated that the evaluation of the so-called'parent-child spawning' phenomena and the approximation accuracy of data-dependent neural tangent kernels. The authors made a laudable effort to put to question common preconceptions about neural tangent kernels through an extensive set of numerical experiments leading to interesting empirical observations.
SoK: On Finding Common Ground in Loss Landscapes Using Deep Model Merging Techniques
Khan, Arham, Nief, Todd, Hudson, Nathaniel, Sakarvadia, Mansi, Grzenda, Daniel, Ajith, Aswathy, Pettyjohn, Jordan, Chard, Kyle, Foster, Ian
Understanding neural networks is crucial to creating reliable and trustworthy deep learning models. Most contemporary research in interpretability analyzes just one model at a time via causal intervention or activation analysis. Yet despite successes, these methods leave significant gaps in our understanding of the training behaviors of neural networks, how their inner representations emerge, and how we can predictably associate model components with task-specific behaviors. Seeking new insights from work in related fields, here we survey literature in the field of model merging, a field that aims to combine the abilities of various neural networks by merging their parameters and identifying task-specific model components in the process. We analyze the model merging literature through the lens of loss landscape geometry, an approach that enables us to connect observations from empirical studies on interpretability, security, model merging, and loss landscape analysis to phenomena that govern neural network training and the emergence of their inner representations. To systematize knowledge in this area, we present a novel taxonomy of model merging techniques organized by their core algorithmic principles. Additionally, we distill repeated empirical observations from the literature in these fields into characterizations of four major aspects of loss landscape geometry: mode convexity, determinism, directedness, and connectivity. We argue that by improving our understanding of the principles underlying model merging and loss landscape geometry, this work contributes to the goal of ensuring secure and trustworthy machine learning in practice.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (2 more...)
- Research Report (1.00)
- Overview (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Government (0.68)
- (2 more...)
Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel
In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK. We do so through a large-scale phenomenological analysis of training, synthesizing diverse measures characterizing loss landscape geometry and NTK dynamics. In multiple neural architectures and datasets, we find these diverse measures evolve in a highly correlated manner, revealing a universal picture of the deep learning process.
Understanding Adversarial Robustness Through Loss Landscape Geometries
Prabhu, Vinay Uday, Yap, Dian Ang, Xu, Joyce, Whaley, John
The pursuit of explaining and improving generalization in deep learning has elicited efforts both in regularization techniques as well as visualization techniques of the loss surface geometry. The latter is related to the intuition prevalent in the community that flatter local optima leads to lower generalization error. In this paper, we harness the state-of-the-art "filter normalization" technique of loss-surface visualization to qualitatively understand the consequences of using adversarial training data augmentation as the explicit regularization technique of choice. Much to our surprise, we discover that this oft deployed adversarial augmentation technique does not actually result in "flatter" loss-landscapes, which requires rethinking adversarial training generalization, and the relationship between generalization and loss landscapes geometries.
- North America > United States > California > Santa Clara County > Palo Alto (0.05)
- North America > United States > California > San Mateo County > Redwood City (0.04)